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Abstract. We consider the learning task consisting in predicting as well 
as the best function in a finite reference set Q up to the smallest possible 
additive term. If R(g) denotes the generalization error of a prediction 
function g, under reasonable assumptions on the loss function (typically 
satisfied by the least square loss when the output is bounded), it is known 
that the progressive mixture rule g satisfies 

ER{g)<mm geg R(g) + C 1 -^M, (1) 

where n denotes the size of the training set, E denotes the expectation 
w.r.t. the training set distribution and C denotes a positive constant. 
This work mainly shows that for any training set size n, there exist e > 0, 
a reference set Q and a probability distribution generating the data such 
that with probability at least e 

R{g) > min ase R{g) + cyj , 

where c is a positive constant. In other words, surprisingly, for appro- 
priate reference set Q, the deviation convergence rate of the progressive 
mixture rule is only of order 1/y/n while its expectation convergence 
rate is of order 1/n. The same conclusion holds for the progressive in- 
direct mixture rule. This work also emphasizes on the suboptimality of 
algorithms based on penalized empirical risk minimization on Q. 



1 Setup and notation 

We assume that we observe n pairs of input-output denoted Z\ = (Xi, Yi), . . . , 
Z n = (X n , Y n ) and that each pair has been independently drawn from the same 
unknown distribution denoted P. The input and output space are denoted re- 
spectively X and y, so that P is a probability distribution on the product space 
Z = X x y. The quality of a (prediction) function g : X — > y is measured by 
the risk (or generalization error): 

R{g) = ^ { x,y)~p l[Y,g{X)], 



where £\Y,g(X)] denotes the loss (possibly infinite) incurred by predicting g(X) 
when the true output is Y. We work under the following assumptions for the 
data space and the loss function £ : y x y — > K U {+00}. 

Main assumptions. The input space is assumed to be infinite: \X\ = +00. 
The output space is a non-trivial (i.e. infinite) interval of M. symmetrical w.r.t. 
some a£l: for any y G y, we have 2a — y G y. The loss function is 

— uniformly exp-concave: there exists A > such that for any y G y, the set 
{y' €R:£(y,y') < +00} is an interval containing a on which the function 
y' 1 > e - xe -(y>y ) is concave. 

— symmetrical: for any yi,y 2 G y, ^ (1/1,2/2) = ^(2a - yi,2a - y 2 ), 

— admissible: for any y, y' G 3^n]a; +00 [, £(y, 2a — y') > £(y, y'), 

— well behaved at center: for any y G yn]a; +00 [, the function £ y : y' 1— > £(y, y') 
is twice continuously differentiable on a neighborhood of a and ^(a) < 0. 

These assumptions imply that 

— y has necessarily one of the following form: ] — 00; +00 [. [a — £; a + £] or 
]a — £; a + C[ for some £ > 0. 

— for any y G y, from the exp-concavity assumption, the function £ y : y' t— > 
£(y, y') is convex on the interval on which it is finite 1 . As a consequence, the 
risk R is also a convex function (on the convex set of prediction functions 
for which it is finite). 

The assumptions were motivated by the fact that they are satisfied in the fol- 
lowing settings: 

— least square loss with bounded outputs: y — [y m in; 2/max] and £{yi, j/2) = {yi~ 
y 2 ) 2 . Thenwehavea = (y m in+y m ax)/2 andmaytakeA = l/[2(y max -y min ) 2 ]. 

— entropy loss: y = [0; 1] and £{y u y 2 ) - 2/1 log (£) + (1 - yi) log (^) . Note 
that £(0, 1) = £(1, 0) = +00. Then we have a = 1/2 and may take A = 1. 

— exponential (or AdaBoost) loss: y = [-y max ; y max ] and £{yi,y 2 ) = e~ VlV2 . 
Then we have a = and may take A = e~ y "^. 

— logit loss: y = [-j/max! J/max] and £{yi,y 2 ) = log(l + e~ VlV2 ). Then we have 

2 

a = and may take A = e _ymax . 

Progressive indirect mixture rule. Let Q be a finite reference set of pre- 
diction functions. Under the previous assumptions, the only known algorithms 
satisfying (1) are the progressive indirect mixture rules defined below. 

For any i G {0, . . . , n}, the cumulative loss suffered by the prediction function 
g on the first i pairs of input-output is 

^(2) = E-=i^<7(^)L 

1 Indeed, if £ denotes the function e~ xly , from Jensen's inequality, for any probability 
distribution, m y (Y) = E( - ± log £(Y)) > -±logE£(r) > -± logf(Ey) = 4(EF). 
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where by convention we take Sq = 0. Let tt denote the uniform distribution on 
Q . We define the probability distribution %i on Q as 

TTi (X e ' ■ 7T 

equivalently for any g G £, 7Tj(#) = e - *^ 3 ) /(J2 g >eG e~ XSi( - 9 '^). This distribution 
concentrates on functions having low cumulative loss up to time i. For any 

1 G {0, . . . , n}, let hi be a prediction function such that 

V(x,y) G Z %&i(a:)] < -ilogE 3 ^ e -^[».»C*)l. (2) 
The progressive indirect mixture rule produces the prediction function 

1 v-^n f 

.9pim — n+1 2^i=0 "'»• 

From the uniform exp-concavity assumption and Jensen's inequality, hi does 
exist since one may take hi = E gr ^% i 9- This particular choice leads to the pro- 
gressive mixture rule, for which the predicted output for any x G X is 

<?pm(£) = J2 g eG (^TT S«=o E s 'la e-*-^ 3 '' ) 

Consequently, any result that holds for any progressive indirect mixture rule in 
particular holds for the progressive mixture rule. 

The idea of a progressive mean of estimators has been introduced by Barron 
([3]) in the context of density estimation with Kullback-Leibler loss. The form 
<7 pm is due to Catoni ([7]). It was also independently proposed in [4], The study 
of this procedure was made in density estimation and least square regression in 
[8,5,15,6]. Results for general losses can be found in [12,2]. Finally, the progressive 
indirect mixture rule is inspired by the work of Vovk, Haussler, Kivinen and 
Warmuth [13,11,14] on sequential prediction and was studied in the "batch" 
setting in [2]. 

The symbol C will denote some positive constant whose value may differ from 
line to line. The logarithm in base 2 is denoted by log 2 (i.e. log 2 i = log£/log2) 
and L^J denotes the largest integer k such that k < x. 

2 Expectation convergence rate 

First let us define the expectation convergence rate of a learning algorithm. 

Definition 1. For a given reference set Q of prediction functions and a set V of 
probability distributions on Z = X xy, a positive sequence (Z\„)„>2 is said to be 
an expectation convergence rate of a learning algorithm producing the prediction 
function g iff there exist C > c > such that 

1. for any distribution P G V and any n > 2, we have 

ER(g) - inf geS R(g) < CA n (3) 
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2. for large enough n, there exists P € V for which 



ER(g) - inf geg R{g) > cA 



We say that the rate A n is optimal iff the previous item 2 is also satisfied for any 
other algorithm, in other words iff there is no algorithm having an expectation 
convergence rate A n satisfying lim„_> +00 A n /A n = 0. 

The following theorem shows that the expectation convergence rate of any 
progressive indirect mixture rule is at least (log |<?|)/?i and that for any positive 
integer d, there exists a set Q of d prediction functions such that this rate is 
optimal whether we take V as the set of all probability distributions on Z or 
the set of all probability distributions on Z for which the output has almost 
surely two symmetrical values (e.g. {-l;+l}-classication with exponential or logit 
losses). 

Theorem 1. Any progressive indirect mixture rule satisfies 



Let y\ G y — {a} and d be a positive integer. There exists a set Q of d predic- 
tion functions such that: for any learning algorithm, there exists a probability 
distribution generating the data for which 

— the output marginal is supported by 2a~y\ and y±: P(Y € {2a — yi; yi}) = 1, 



- ER(g) > mmR(g) + e- 1 K(l/\ L '°^ l 1 g|J ), with k = sup [£(y u a)-£(yi, y)) > 0. 



Proof. See Appendix A. 

The second part of Theorem 1 has the same (log |<7|/n)-rate as the lower 
bounds obtained in sequential prediction ([11]). From the link between sequential 
predictions and our "batch" setting with i.i.d. data (see e.g. [2, Lemma 3]), 
upper bounds for sequential prediction lead to upper bounds for i.i.d. data, and 
lower bounds for i.i.d. data leads to lower bounds for sequential prediction. The 
converse of this last assertion is not true, so that the second part of Theorem 1 
is not a consequence of the lower bounds of [11]. 

The following theorem shows that for appropriate set Q: 

— the empirical risk minimizer has a -\/log | Q \ / n-expectation convergence rate. 

— any empirical risk minimizer and any of its penalized variants are really 
poor algorithms in our learning task since their expectation convergence 
rate cannot be faster than \J\og \ G\/n. This last point explains the interest 
we have in progressive mixture rules. 

Theorem 2. // B = aup y y , y „ £ y[£(y,y') — i(y,y")] < +oo, then any empiri- 
cal risk TTiifiiuiizcT ' } which produces a prediction function Qerm ^ cirgmiTig^g E n , 



ER{g vtm )< mm R( g ) + *0L. 

gey y ' 



satisfies: 
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Let 2/1,2/1 G yC]]a; +oo[ and d be a positive integer. There exists a set Q of d 
prediction functions such that: for any learning algorithm producing a prediction 
function in Q (e.g. g er m) there exists a probability distribution generating the 
data for which 

— the output marginal is supported by 2a — 2/1 and y\: P(Y G {2a — yi; 2/1}) = 1> 

- ER(g) > mmR(g) + I^^Sim A2 y with 5 4 l{y u 2a-g{)-l{y u y{) > 0. 

Proof. See Appendix B. 

3 Deviation convergence rate 

The efficiency of an algorithm g can be summarized by its expected risk KR(g), 
but this does not precise the fluctuations of R(g). In several application fields 
of learning algorithms, these fluctuations play a key role: in finance for instance, 
the bigger the losses can be, the more money the bank needs to freeze in order to 
alleviate these possible losses. In this case, a "good" algorithm is an algorithm 
having not only low expected risk but also small deviations. 

The deviation convergence rate we define now is concerned with exponen- 
tial deviation inequalities (such as Hocffding's inequality or more generally such 
as standard statistical learning inequalities on the supremum of empirical pro- 
cesses) . 

Definition 2. Let < 7 < 1. For a given reference set Q of prediction functions 
and a set V of probability distributions on Z = X x y , a positive sequence 
(Z\^)„gN is said to be a deviation convergence rate of order 7 of a learning 
algorithm iff there exist C > c > such that 

1. for any distribution integer n > 2, and e > 0, with probability at least 
1 — e w.r.t. the training set distribution, we have 

R(g)-M geg R(g) < Cflog^ee" 1 )]^ (4) 

2. for large enough n, there exist e > and a distribution P G V such that with 
probability at least e w.r.t. the training set distribution, we have 

R(g)-M geg R(g) > cflog^ee^ 1 )] A' n . 

The following lemma shows that the expectation convergence rate of a learn- 
ing algorithm is at least of order of its deviation convergence rate. The expec- 
tation convergence rate can also be strictly faster as the comparison between 
Theorems 1 and 3 shows. 

Lemma 1. Let g satisfy: for any e > 0, with probability at least 1 — e, (4) holds. 
Then we have 

ER(g) - mt ge g R(g) < 2^CA' n . 



5 



Proof. It suffices to integrate the deviations. Let R* = inf 9e g R{g)- By Jensen's 
inequality, we have 



' R(g)-R' 



< E " yy >-" -1 



< E 



CA 

R{S)-R*\lh 



1 



voj 

/o^P{( i W : ) 1/7 - 1 >^ 
= J,, 1 P{i?(.9) - i?* > CZ^ log^ee" 1 )^ [setting u = log^" 1 )] 
< 1. 

The following theorem shows that the deviation convergence rate of order 
1/2 of any progressive indirect mixture rule is at least ^j\fn and that there 
exists Q such that the deviation convergence rate of order 1/2 of any progressive 
indirect mixture rule is 1/V™ whether we take V as the set of all probability 
distributions on Z or the set of all probability distributions on 2, for which the 
output has almost surely two symmetrical values (e.g. {-l:+l}-classication with 
exponential or logit losses). 

Theorem 3. // B = sup y y i [^(y, ?/) — ^(2A 2/")] < +oo, then any progressive 
indirect mixture rule satisfies: for any e > 0, with probability at least 1 — e w.r.t. 
the training set distribution, we have 



R{g plm ) < mm R(g) + B y K+1 , j^^jj 



/ 21og(e-i) j_ log|g| 

sea 



Let yi and y~\ in yn]a;+oo[ such that £ yi is twice continuously differ entiable 
on [a; j/i] and £' yi (y~i) < and ^(z/i) > 0. Consider the prediction functions 
gi = y~\ and 92 = 2a — y\. For any training set size n large enough, there exist 
e > and a distribution generating the data such that 

— the output marginal is supported by y\ and 2a — y\ 

— with probability larger than e, we have 



R(g pim ) - min R(g) > cJ 1 -^^ 
36(91,92} v 

where c is a positive constant depending only on the loss function, the sym- 
metry parameter a and the output values y\ and y~\. 

Proof. See Section 4. 

This result is quite surprising since it gives an example of an algorithm which 
is optimal in terms of expectation convergence rate and for which the deviation 
convergence rate is (significantly) worse that the expectation convergence rate. 
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4 Proof of Theorem 3 



4.1 Proof of the upper bound 

Wc would like to thank an anonymous reviewer for suggesting the following proof, 
which leads to better constants than the original one based on PAC-Bayesian 
inequalities. 

Let Z n +\ = (X„+i,Y n +i) be an input-output pair independent from the 
training set Z\, . . . , Z n and with the same distribution P. From the convexity of 
y' i— > i(y, y'), we have 

i?(5pim)<^Er=0^(U (5) 

Now from [16, Theorem 1] (see also [9, Proposition 1]), for any e > 0, with 
probability at least 1 — e, we have 



^Eto^) < HTiEILo^+i.te+i)) + ® 

Using [11, Theorem 3.8] and the exp-concavity assumption, we have 

E?=o i(Y i+1 ,h(X i+1 )) < min E?=o ^i+i, g(X i+1 )) + ^ (?) 

gey 

Let g € argruing i?. By Hocffding's inequality, with probability at least 1 — e, we 
have 

Merging (5), (6), (7) and (8), with probability at least 1 — 2e, we get 



4.2 Proof of the lower bound 

We cannot use standard tools like Assouad's argument (see e.g. [10, Theorem 
14.6]) because if it were possible, it would mean that the lower bound would 
hold for any algorithm and this is (non trivially) false. 

To prove that any progressive indirect mixture rule have no fast exponen- 
tial deviation inequalities, we will show that on some event with not too small 
probability, for most of the i in {0, . . . W—xSi concentrates on the wrong 
function. 

The proof is organized as follows. First we define the probability distribution 
for which we will prove that the progressive indirect mixture rules cannot have 
fast deviation convergence rates. Then we define the event on which the progres- 
sive indirect mixture rules do not perform well. We lower bound the probability 
of this excursion event. Finally we conclude by lower bounding i?(.g P i m ) on the 
excursion event. 

Before starting the proof, note that from the "well behaved at center" and 
exp-concavity assumptions, for any y G 3^n]a; +oo[, on a neighborhood of a, we 
have: £" > X(£' y ) 2 and since i' y {a) < 0, y\ and y~i exist. 
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Probability distribution generating the data and first consequences. 

Let 7 G]0; 1] be a parameter to be tuned later. We consider a distribution gen- 
erating the data such that the output distribution satisfies for any x G A" 

P(Y = yx\X = x) = (1 + 7 )/2 = 1 - P{Y = y 2 \X = x), 

where ?/2 = 2a — 2/1 • Let y~2 = 2a — y%. From the symmetry and admissibility 
assumptions, we have £(2/2,2/2) = ^{vuVi) < ^{vuVi) = £{V2, 2/i)- Introduce 

5^£(y 1 ,y~ 2 )-£(y 1 ,y~ 1 )>0. (9) 

We have 

R{92) - R(gi) = ^(2/1,2/2) - %i,y~i)] + ^(2/2,2/2) -i(y»,vi)] = j5. 

(10) 

Therefore 171 is the best prediction function in {</i,<?2} for the distribution we 
have chosen. Introduce Wj = ly i=yi — ly,=y 2 an d &i — Yl]=i Wj. For any 
i G {1, . . . , n}, we have 

Eifa) - Ei( 9l ) = Ej=i[^' W) = E-=i W i 5 = SS < 

The weight given by the Gibbs distribution n—\St to the function gi is 

/ \ e - xs i(si) _ 1 _ 1 /-.-.■, 



An excursion event on which the progressive indirect mixture rules 
will not perform well. (11) leads us to consider the event: 

E T = {\fi G {t, . . . ,71}, Si < -t}, 

with t the smallest integer larger than (log n) / (X5) such that n — r is even. (We 
could have just as well chosen n — t odd; see (17) below.) We have 

^<t<!^ + 2. (12) 

The event E T can be seen as an excursion event of the random walk defined 
through the random variables Wj = \y =y x — ly =j, 2 , j G {1, . . . , n}, which are 
equal to +1 with probability (1 + 7)/2 and —1 with probability (1 — 7)/2. 

From (11), on the event E T , for any i G {r, . . . , n}, we have 

n-xz, (9i) < 7^1- (13) 

This means that it-xSi concentrates on the wrong function, i.e. the function g 2 
having larger risk (see (10)). 
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Lower bound of the probability of the excursion event. This requires 
to look at the probability that a slightly shifted random walk in the integer 
space has a very long excursion above a certain threshold. To lower bound this 
probability, we will first look at the non-shifted random walk. Then we will 
see that for small enough shift parameter, probabilities of shifted random walk 
events are close to the ones associated to the non-shifted random walk. 

Let N be a positive integer. Let cr l5 . . . , er/v be N independent Rademachcr 
variables: P(<t, = +1) = P((7.; = —1) = 1/2. Let = Sj=i a i ^ c the sum °f 
the first i Rademachcr variables. We start with the following lemma for sums of 
Rademachcr variables. 

Lemma 2. Let m and t be positive integers. We have 

P( max Sfc > t;sN ^ t; |sjv — i| < m) = 2F(t < sjv < t + m) (14) 

Proof (of Lemma 2). The result comes from the well known mirror trick used to 
compute the law of (sup s<t W A s , Wt) where W denotes a Brownian motion. Con- 
sider a sequence 01, . . . , o~n which belongs to the event £ of the l.h.s. probability. 
Let J be the first integer j such that Sj = t. Since 

— the sequences <7i,...,<7jv and u\, . . . , aj, — <7j_|_i, . . . , — er/v have the same 
probabilities, 

— both sequences belong to £ and are different since J < N, 

— exactly one of the sequences satisfy sn > t, 

we have 

P( max^Sfc > t; sjy t; \sjy — t\ < m) = 2P(s w > *i \$n — t\ < to), 

which is the desired result. 

Let cr[ , . . . , a' N be N independent shifted Rademacher variables to the extent 
that P(o-- = +1) = (1 + 7)/2 = 1 - P((T- = -1). These random variables satisfy 
the following key lemma 

Lemma 3. For any set A C {(ei, • ■ • , £jv) € { — 1, 1}" : | Y^iLi e «| — where 
M is a positive integer, we have 

P{K, ...,a' N )eA}> (^) M/2 (1 - 7 2 ) N/2 n(°U -..,<r N )eA} (15) 

Proof (of Lemma 3). Let s be an integer such that N — s is even and \s\ < M 
Consider a sequence ei, . . . , ejv such that Y^iLi e » = s - Then the numbers of —1 
and +1 in the sequence are respectively (N — s)/2 and (N + s)/2. Consequently, 
we have 

P[K,...,<r; v ) = (ei,...,ejv)] _ , n(JV-s)/2/1 _ \(N+s)/2 
P[(<n,. ..,<T N ) = (t u ...,£„)] - ^ + \ 'J 

hence 

P{(ai,...,0 = {e!,..., e N )} 

> (1 - T 2 )^ 2 !^)^!^!, ■ • • = (er, . . . , e„)}. 
By summing over the sequences ei, . . . , ejy in A, we obtain the desired result. 
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We may now lower bound the probability of the excursion event E T . Let M 
be an integer larger than r. We still use Wj = \y j=Vl — lY j= y 2 f° r 3 <= {lj ■ • ■ i n }- 
By using Lemma 3 with N = n — 2r, we obtain 



> ] 


?(Wi = 


-i,...,w 2t 




(i^) 2T i 


J (Vi > 2r 




(±^) 2t j 


J (Vi G {1,.. 


> 


(V) 2 1 


>(|£il^| 


> 


(¥0 2r < 


S) M /2(1 . 



(16) 

By using Lemma 2, since r < M, the r.h.s. probability can be lower bounded: 

F(\s n \ < M: max s, < r) 

= ¥< max Si < t;sn > —M } 

ll<i<N J 

> max Si < t; \sn — t\ < M + t; s n ^ t\ 
= v{\s n -t\ <M + r;Siv^r} 

— P< max Si > r; \sn — t| < M + r: sjv 7^ t > 
U<i<JV - ' 1 - J 

= P{| SiV -r| < M + t;sjv 7^ t} - 2P{r < s N < M + 2r} 
= P{ - M < s N < t} - P{t <s n <M + 2t} 
= P{ - t < s N < A/} - P{t < s N <M + 2t} 
= P{ - r < SA r < t} - P{M < sjv < M + 2t} 

Let us consider only the integer M > r such that n — Af is even, or equiva- 
lently iV — M is even. Since TV — r = n — 3r is also even, we have 

Pflsjvl < M; max s, < t) 
v l<i<JV ' 

> ElZl f ( s n = 2 - r + 2fc) - ^(sn = M + 2k) (17) 

>r[P(s Ar =r)-P(s Ar = M)], 

where the last inequality comes from properties of the binomial coefficients. 
Combining (16) and (17), we obtain 

HE T ) > T (i^) 2T (i T 2) M/2 (l - 7 2 )%( Sjv = r) F(s N = M)\ (18) 

where we recall that r have the order of logn,, N = n — 2t has the order of n 
and that 7 > and Af > r have to be appropriately chosen. 

To control the probabilities of the r.h.s., we use Stirling's formula 

n"e-™V2^e 1 /( 12n+1 ) < n! < 72 n e~ n V2^e 1 ^ 12n \ (19) 
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and get for any s£ [0; N] such that N — s even, 
Hs N = s) = (l) N ( N N +B ) 



(J^±£)^3-(«^£)^- % / 7r (7V+ S ) v / 7 r(Ar- S )e 8 ( N + 3 )e 6 ( N -'') 



1 / 2 AT 



and similarly 



(20) 



P("" = -) < V^ 1 " £)~ " (w)" e ™ (21) 

These computations and (18) leads us to take M as the smallest integer larger 
than such that n — M is even. Indeed, from (12), (20) and (21), we obtain 
lim n ^ +oc , ^/n[r(s N = r) - P(sjv = A/)] = c, where c = \/2/V(l ~ e~ 1/2 ) > 0. 
Therefore for n large enough we have 

n^)>^(V) 2T (^) M/2 (i-7 2 ) f (22) 

The last two terms of the r.h.s. of (22) leads us to take 7 of order 1/y/n up 
to possibly a logarithmic term. We obtain the following lower bound on the 
excursion probability 



Lemma 4. If"f= y/Co (log njjn with Cq a positive constant, then for any large 
enough n, 

HEr) > 

Behavior of the progressive indirect mixture rule on the excursion 
event. From now on, we work on the event E T . We have g v - lm = (X)"=o hi)/{n + 
1). We still use S = £(yi,y 2 ) - t(yi,Vi) = £{y2,y~x) - ^(2/2,2/2). On the event E r , 
for any x G X and any i € {r, . . . , n}, by definition of hi, we have 

£[y 2 ,k(x)}-£(y 2 ,y 2 ) < -± logE,^.^ e -M4ia,»(*)M(».i&)} 

= -i log {^_ ASi ( 3 i)e- Ai + 7r_A£«(fls)} 

= -|log{e-^ + (l-e-^)7r-A^(32)} 
<-|log{l-(l_e-A*)_l_} 

In particular, for any n large enough, we have £\y 2 ,hi{x)] — £(y 2 ,y 2 ) < Cn _1 , 
with C > independent from 7. From the convexity of the function y 1— > ^(2/2, y) 
and by Jensen's inequality, we obtain 



£[y 2 , g P im(x)} - £(y 2 , y 2 ) = £[y 2 , ^ Yh=o h i{ x Y\ ~ e (y2,y 2 ) 
< ;^h-E"=o%2A(>)] -i(y2,m) 



— 71+1 

<Ci lo^ 



(23) 
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for some constant C\ > independent from 7. Let us now prove that for n large 
enough, we have 

2/2 < ffpim (x) <y 2 + C^j^ < y\ , (24) 
with C > independent from 7. 

Proof. For any y G ^, let i = 2a-y. We have £(j/ 2 , y)~KV2, 2/2) = tyi{t)-£ yi (y~i)- 
Since ^(2/1) < 0, ^(2/1) > 0, ^ > A(^) 2 and ^ is continuous on [a;yi], 
there exists m > such that £ yi > ?ti on [a:yi]. For any y 2 < y < a, from 
Taylor's expansion, we have 

%»,!/) " %2, 2/2) > (t - 2/1)^ (2/1) + ^f^m 

> (25) 
_ (v-m) 2 ^ 



Let 2/0 =y~2 + J 2C )n°i " wnere Ci is the constant appearing in (23). For n large 
enough, we have yo < a and we may apply (25) to y = j/o- We get 

%2, 2/o) -^(2/2,2/2) ><?i^. (26) 

Since £ yi is convex, £' Vl (yi) < and ^(2/1) > 0, the function £ Vl decreases on 
] — 00; y~i] n y. By symmetry, the function y 1— > £(y2,y) is non-decreasing on 
[?/2;+oo[n 3^. From (23) and (26), we get (? p im(a;) ^ [2/o;+°°[, which ends the 
proof of the upper bound of g pim (x). 

For the lower bound, for any a; € A", by definition of hi, we have 

l[yi,hi{x)) -£(yi,y~i) < -4logE fl ^_ AB . e -M<[vi,*(*)M(vi.tfi)} 



^log^^, (SiK^-A^^e-^} 



< <5. 

By Jensen's inequality, we obtain 

^ [ff P im(a;)] -£ yi (m) =%,^ELo hi{x)] -£{yi,y~i) 
< ^EIU^M^)] -^(2/1, 2/1) 

= ^1(2/2) -e yi (y~i) ■ 

Since the function decreases on ] — oo;j/2] H J 7 , we get that <?pim(aO > 2/2, 
which ends the proof of (24). 

From (24), we obtain 

^(ffpim) - R(gi) = ^ [^(2/1, 5pim) - %L,2/l)] + ^[^(2/2, ffpim) -^(y2,2/l)] 

_ 1+- 



2 

1+7 
2 



/j/i (<7pim) - £ V1 (2/l)] + V 1 [ £ V1 ( 2a - 5pim) - £ V1 (2/2)] 
(S + 4l(3pim)-^l(2/~2)] 

+i^[-«5 + ^ i (2a-ff piln )-4 1 (2/- 1 )] 
>7^-(3 P ixn -JkW yi (y~2)\ 



(27) 
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with C2 independent from 7. We may take 7 = yj (log n) /n and obtain: for 
n large enough, on the event E T , we have R(g p i m ) — R(si) > Cy/\ogn/n. From 
Lemma 4, this inequality holds with probability at least \jn Cl for some C4 > 0. 
To conclude, for any n large enough, there exists e > s.t. with probability at 
least e, 

R(g pim )-R(9i)>c^ l ^ll. 

where c is a positive constant depending only on the loss function, the symmetry 
parameter a and the output values y\ and y\. 

Remark 1. Had we consider the progressive mixture rule, this last part of the 
proof would have been much simpler. Indeed, for n large enough, on the event 
E T , from (13), we have 

P=^tEIU^(5i)<^t+ sup n _ XSi ( gi )<C^ 

r<i<n 

and g pm = ^ Ya=o ^g^-xs, 9 = P9i + 0--P)92 = V2+v{ili - 2/2)- So we have 

2/2 <5pm<2/2 + C^ <y u 

which is much stronger than (24) (and much simpler to prove). 



A Proof of Theorem 1 

The first assertion is a direct consequence of Lemma 3.3 and Corollary 4.1 of [2]. 
The second assertion is based on an Assouad's type lower bound ([1, Inequality 
(8.19)]. Let 2/2 = 2a — yi and rh = [\og 2 \G\\ ■ We use the notation introduced in 
[1, Section 8.1]. We consider a (m, A l)-hypercube of probability distri- 
butions with hi = axgmm ye yi yi (y) and h 2 = argmin yg -y£j, 2 (y). We obtain 

ER(g) - mmR{g) > (^011 A i) dl (i _ _1_ A TT5i ^)" 

> { \^m A i)^-!, 

where the last inequality comes from [1 — l/(n + 1)]" \ e _1 . Now the edge 
discrepancy d\ can be computed: 

di = ^1,0,2/1,2/2 (V 2 ) 

= ^«M_i W%liy) _iM %2 , y ) 

= inf ^■»)+'(»'' 2 °-») - inf £( yi ,y) 
ye^ 2 2/e* Vy y; 

= sup [£(yi,a) - i{yx,y)), 

where the last equality uses that the function y t— > £ fa 1 ^)+ £ fa 1 ^ 2a ^^ / ) j s convex. 
Finally, from the "well behaved at center" assumption, the supremum is positive. 



13 



B Proof of Theorem 2 

Let g <E argmin g R and 77 > 0. Hocffding's inequality applied to the random 
variable W = £[Y, g(X)\ - C[Y, g(X)} £ [-B; B] for a fixed g^Q gives 

Ke v[W-EW] < £ v 2 B 2 /2 

for any r\ > 0. Since the random variable Z\, . . . , Z n are independent, we obtain 

^ e n[nR(g)-nR{g)+S n (g)-S n (g)\ < eV 2 nB 2 /2^ 

Consequently we have 

n{ER(g elm ) - R{g)} < E{nR{g crm ) - nR{g) + S n (g) - S n (g elm )} 

< - l gEe ,)[ ™ i?( ^™ ) - rifl ^ )+2: " te )-^(§o™)] 

< i log EJ] e r > [nR{9 ^ nRi ^+ E "(9)-S n {g)] 
<Il0g(|Se^ 2 /2). 



The first assertion follows from the (optimal) choice 77 = J (2 log \Q\)/(nB 2 ). 

The second assertion is based on an Assouad's type lower bound. Let 2/2 = 

2a — yi and m = Ll°g2 \p\\ - We use the notation introduced in [1, Section 8.1]. 

We consider a (to, i, dn)-hypercube of probability distributions with hi = y~\ 

and /12 = 2/2 — 2a — y~\ and dn has to be optimized in [0; 1]. In the proof of 

Theorem 1, we take the set Q such that min gS g R{g) = min 9 R(g), where the 

second minimum is w.r.t. all possible prediction functions. Here the trick is to 

realize that min g6 g R{g) for our learning setting equals to min 9 R(g) for the 

learning task in which the output space is only {yi , j/2}- Therefore we apply 

([1, Inequality (8.17)] with the function 4> appearing in the edge discrepancy a\ 

defined as 4> yi . y2 {p) = min {p£(yi,y) + (1 - p)%2, y)}- We get 
y£{yi,V2} 

Ei?(g) > min R(g) + mwdi(l — \/nwdiiJ 
g^G 

= min R(g) + dA 1 
see V 

From the symmetry and admissibility assumptions of the loss function, we have 
^(z/2,2/2) = t(yi,yi) > t(V2,yi) = ^(2/1,2/2), hence 6 = £(yx, m) ~ > 0. 

We obtain 

— — l- u -,«/i,y2 
= ^2/1,3/2(1/2) — 2^yi^2^ 2 ) ~ l^s/i^zf 2 ) 

= 0i/l,S/2 (1/2) — ,3/2 ( 



= + |% 2 ,2/i) - (i±#i%i,?7i) + i^^ 2 ,yi)) 

The optimization of the lower bound leads us to choose d\\ = A 1 and we get 
the desired result. 



14 



References 



1. J.-Y. Audibert. Fast learning rates in statistical inference through aggrega- 
tion. Research report 06-20, Certis - Ecole des Ponts, http://cermics.enpc. 
fr/~audibert/RR0620d.pdf , 2006. 

2. J.-Y. Audibert. A randomized online learning algorithm for better variance control. 
In Proceedings of the 19th annual conference on Computational Learning Theory 
( COLT), Lecture Notes in Computer Science, volume 4005, pages 392-407, 2006. 

3. A. Barron. Are bayes rules consistent in information? In T.M. Cover and 
B. Gopinath, editors, Open Problems in Communication and Computation, pages 
85-91. Springer, 1987. 

4. A. Barron and Y. Yang. Information-theoretic determination of minimax rates of 
convergence. Ann. Stat, 27(5):1564-1599, 1999. 

5. G. Blanchard. The progressive mixture estimator for regression trees. Ann. Inst. 
Henri Poincare, Probab. Stat., 35(6):793-820, 1999. 

6. F. Bunea and A. Nobel. Sequential procedures for aggregating arbitrary estimators 
of a conditional mean, 2005. Technical report, Available from http://stat.fsu. 
edu/~f lori/ps/bnapril2005IEEE.pdf . 

7. O. Catoni. A mixture approach to universal model selection, preprint LMENS 97- 
30, Available from http://www.dma.ens.fr/edition/preprints/Index.97.html, 
1997. 

8. O. Catoni. Universal aggregation rules with exact bias bound. Preprint n.510, 
http : //www.proba. jussieu. f r/mathdoc/preprints/index . html#1999, 1999. 

9. N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on- 
line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050- 
2057, 2004. 

10. L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recogni- 
tion. Springer- Verlag, 1996. 

11. D. Haussler, J. Kivinen, and M. K. Warmuth. Sequential prediction of individ- 
ual sequences under general loss functions. IEEE Trans, on Information Theory, 
44(5): 1906-1925, 1998. 

12. A. Juditsky, P. Rigollet, and A.B. Tsybakov. Learning by mirror averaging. 
Preprint n.1034, Laboratoire de Probabilites et Modeles Aleatoires, Universites 
Paris 6 and Paris 7, http://arxiv.org/abs/math/0511468, 2006. 

13. V.G. Vovk. Aggregating strategies. In COLT '90: Proceedings of the third annual 
workshop on Computational learning theory, pages 371-386, San Francisco, CA, 
USA, 1990. Morgan Kaufmann Publishers Inc. 

14. V.G. Vovk. A game of prediction with expert advice. Journal of Computer and 
System Sciences, pages 153-173, 1998. 

15. Y. Yang. Combining different procedures for adaptive regression. Journal of mul- 
tivariate analysis, 74:135-161, 2000. 

16. T. Zhang. Data dependent concentration bounds for sequential prediction algo- 
rithms. In Proceedings of the 18th annual conference on Computational Learning 
Theory (COLT), Lecture Notes in Computer Science, pages 173-187, 2005. 



15 



